{
  "nbformat": 4,
  "nbformat_minor": 5,
  "metadata": {
    "colab": {
      "name": "Inverse_Text_Normalization.ipynb",
      "provenance": [],
      "collapsed_sections": []
    },
    "kernelspec": {
      "display_name": "Python 3",
      "language": "python",
      "name": "python3"
    },
    "language_info": {
      "codemirror_mode": {
        "name": "ipython",
        "version": 3
      },
      "file_extension": ".py",
      "mimetype": "text/x-python",
      "name": "python",
      "nbconvert_exporter": "python",
      "pygments_lexer": "ipython3",
      "version": "3.7.8"
    }
  },
  "cells": [
    {
      "cell_type": "code",
      "metadata": {
        "id": "U1GACXvL5GhV"
      },
      "source": [
        "if 'google.colab' in str(get_ipython()):\n",
        "  !pip install -q condacolab\n",
        "  import condacolab\n",
        "  condacolab.install()"
      ],
      "id": "U1GACXvL5GhV",
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "YxVLI-f97Kxl"
      },
      "source": [
        "\"\"\"\n",
        "You can run either this notebook locally (if you have all the dependencies and a GPU) or on Google Colab.\n",
        "\n",
        "Instructions for setting up Colab are as follows:\n",
        "1. Open a new Python 3 notebook.\n",
        "2. Import this notebook from GitHub (File -> Upload Notebook -> \"GITHUB\" tab -> copy/paste GitHub URL)\n",
        "3. Connect to an instance with a GPU (Runtime -> Change runtime type -> select \"GPU\" for hardware accelerator)\n",
        "\"\"\"\n",
        "\n",
        "BRANCH = 'v1.0.0'"
      ],
      "id": "YxVLI-f97Kxl",
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "TcWLxxXC7Jgx"
      },
      "source": [
        "\n",
        "# If you're using Google Colab and not running locally, run this cell.\n",
        "# install NeMo\n",
        "if 'google.colab' in str(get_ipython()):\n",
        "  !python -m pip install git+https://github.com/NVIDIA/NeMo.git@$BRANCH#egg=nemo_toolkit[all]"
      ],
      "id": "TcWLxxXC7Jgx",
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "4nf8sui349co"
      },
      "source": [
        "if 'google.colab' in str(get_ipython()):\n",
        "  !conda install -c conda-forge pynini=2.1.3\n",
        "  ! mkdir images\n",
        "  ! wget https://github.com/NVIDIA/NeMo/blob/$BRANCH/tutorials/text_processing/images/deployment.png -O images/deployment.png\n",
        "  ! wget https://github.com/NVIDIA/NeMo/blob/$BRANCH/tutorials/text_processing/images/pipeline.png -O images/pipeline.png"
      ],
      "id": "4nf8sui349co",
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "homeless-richardson"
      },
      "source": [
        "import os\n",
        "import wget\n",
        "import pynini\n",
        "import nemo_text_processing"
      ],
      "id": "homeless-richardson",
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "interracial-listing"
      },
      "source": [
        "# Task Description\n",
        "\n",
        "Inverse text normalization (ITN) is a part of the Automatic Speech Recognition (ASR) post-processing pipeline. \n",
        "\n",
        "ITN is the task of converting the raw spoken output of the ASR model into its written form to improve the text readability. For example, `in nineteen seventy` should be changed to `in 1975` and `one hundred and twenty three dollars` to `$123`."
      ],
      "id": "interracial-listing"
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "desirable-words"
      },
      "source": [
        "# NeMo Inverse Text Normalization\n",
        "\n",
        "NeMo ITN is based on weighted finite-state\n",
        "transducer (WFST) grammars. The tool uses [`Pynini`](https://github.com/kylebgorman/pynini) to construct WFSTs, and the created grammars can be exported and integrated into [`Sparrowhawk`](https://github.com/google/sparrowhawk) (an open-source version of [The Kestrel TTS text normalization system](https://www.cambridge.org/core/journals/natural-language-engineering/article/abs/kestrel-tts-text-normalization-system/F0C18A3F596B75D83B75C479E23795DA)) for production. The NeMo ITN tool can be seen as a Python extension of `Sparrowhawk`. \n",
        "\n",
        "Currently, NeMo ITN provides support for English and the following semiotic classes from the [Google Text normalization dataset](https://www.kaggle.com/richardwilliamsproat/text-normalization-for-english-russian-and-polish):\n",
        "DATE, CARDINAL, MEASURE, DECIMAL, ORDINAL, MONEY, TIME, PLAIN. \n",
        "We additionally added the class `WHITELIST` for all whitelisted tokens whose verbalizations are directly looked up from a user-defined list.\n",
        "\n",
        "The toolkit is modular, easily extendable, and can be adapted to other languages and tasks like [text normalization](https://github.com/NVIDIA/NeMo/blob/main/tutorials/text_processing/Text_Normalization.ipynb). The Python environment enables an easy combination of text covering grammars with NNs. \n",
        "\n",
        "The rule-based system is divided into a classifier and a verbalizer following  [Google's Kestrel](https://www.researchgate.net/profile/Richard_Sproat/publication/277932107_The_Kestrel_TTS_text_normalization_system/links/57308b1108aeaae23f5cc8c4/The-Kestrel-TTS-text-normalization-system.pdf) design: the classifier is responsible for detecting and classifying semiotic classes in the underlying text, the verbalizer the verbalizes the detected text segment. \n",
        "\n",
        "The overall NeMo ITN pipeline from development in `Pynini` to deployment in `Sparrowhawk` is shown below:\n",
        "![alt text](images/deployment.png \"Inverse Text Normalization Pipeline\")"
      ],
      "id": "desirable-words"
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "military-radius"
      },
      "source": [
        "# Quick Start\n",
        "\n",
        "## Add ITN to your Python ASR post-processing workflow\n",
        "\n",
        "ITN is a part of the `nemo_text_processing` package which is installed with `nemo_toolkit`. Installation instructions could be found [here](https://github.com/NVIDIA/NeMo/tree/main/README.rst)."
      ],
      "id": "military-radius"
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "limiting-genesis"
      },
      "source": [
        "from nemo_text_processing.inverse_text_normalization.inverse_normalize import InverseNormalizer\n",
        "\n",
        "inverse_normalizer = InverseNormalizer()\n",
        "\n",
        "raw_text = \"we paid one hundred and twenty three dollars for this desk, and this.\"\n",
        "inverse_normalizer.inverse_normalize(raw_text, verbose=False)"
      ],
      "id": "limiting-genesis",
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "downtown-inventory"
      },
      "source": [
        "In the above cell, `one hundred and twenty three dollars` would be converted to `$123`, and the rest of the words remain the same.\n",
        "\n",
        "## Run Inverse Text Normalization on an input from a file\n",
        "\n",
        "Use `run_predict.py` to convert a spoken text from a file `INPUT_FILE` to a written format and save the output to `OUTPUT_FILE`. Under the hood, `run_predict.py` is calling `inverse_normalize()` (see the above section)."
      ],
      "id": "downtown-inventory"
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "streaming-butterfly"
      },
      "source": [
        "# If you're running the notebook locally, update the NEMO_TEXT_PROCESSING_PATH below\n",
        "# In Colab, a few required scripts will be downloaded from NeMo github\n",
        "\n",
        "NEMO_TOOLS_PATH = '<UPDATE_PATH_TO_NeMo_root>/nemo_text_processing/inverse_text_normalization'\n",
        "DATA_DIR = 'data_dir'\n",
        "os.makedirs(DATA_DIR, exist_ok=True)\n",
        "\n",
        "if 'google.colab' in str(get_ipython()):\n",
        "    NEMO_TOOLS_PATH = '.'\n",
        "\n",
        "    required_files = ['run_predict.py',\n",
        "                      'run_evaluate.py']\n",
        "    for file in required_files:\n",
        "        if not os.path.exists(file):\n",
        "            file_path = 'https://raw.githubusercontent.com/NVIDIA/NeMo/' + BRANCH + '/nemo_text_processing/inverse_text_normalization/' + file\n",
        "            print(file_path)\n",
        "            wget.download(file_path)\n",
        "elif not os.path.exists(NEMO_TOOLS_PATH):\n",
        "      raise ValueError(f'update path to NeMo root directory')\n",
        "\n",
        "INPUT_FILE = f'{DATA_DIR}/test.txt'\n",
        "OUTPUT_FILE = f'{DATA_DIR}/test_itn.txt'\n",
        "\n",
        "! echo \"on march second twenty twenty\" > $DATA_DIR/test.txt\n",
        "! python $NEMO_TOOLS_PATH/run_predict.py --input=$INPUT_FILE --output=$OUTPUT_FILE"
      ],
      "id": "streaming-butterfly",
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "cordless-louisiana"
      },
      "source": [
        "# check that the raw text was indeed converted to the written form\n",
        "! cat $OUTPUT_FILE"
      ],
      "id": "cordless-louisiana",
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "italic-parish"
      },
      "source": [
        "## Run evaluation\n",
        "\n",
        "[Google Text normalization dataset](https://www.kaggle.com/richardwilliamsproat/text-normalization-for-english-russian-and-polish) consists of 1.1 billion words of English text from Wikipedia, divided across 100 files. The normalized text is obtained with [The Kestrel TTS text normalization system](https://www.cambridge.org/core/journals/natural-language-engineering/article/abs/kestrel-tts-text-normalization-system/F0C18A3F596B75D83B75C479E23795DA)).\n",
        "\n",
        "Although a large fraction of this dataset can be reused for ITN by swapping input with output, the dataset is not bijective. \n",
        "\n",
        "For example: `1,000 -> one thousand`, `1000 -> one thousand`, `3:00pm -> three p m`, `3 pm -> three p m` are valid data samples for normalization but the inverse does not hold for ITN. \n",
        "\n",
        "We used regex rules to disambiguate samples where possible, see `nemo_text_processing/inverse_text_normalization/clean_eval_data.py`.\n",
        "\n",
        "To run evaluation, the input file should follow the Google Text normalization dataset format. That is, every line of the file needs to have the format `<semiotic class>\\t<unnormalized text>\\t<self>` if it's trivial class or `<semiotic class>\\t<unnormalized text>\\t<normalized text>` in case of a semiotic class.\n",
        "\n",
        "Example evaluation run: \n",
        "\n",
        "`python run_evaluate.py \\\n",
        "        --input=./en_with_types/output-00001-of-00100 \\\n",
        "        [--cat CATEGORY] \\\n",
        "        [--filter]`\n",
        "        \n",
        "        \n",
        "Use `--cat` to specify a `CATEGORY` to run evaluation on (all other categories are going to be excluded from evaluation). With the option `--filter`, the provided data will be cleaned to avoid disambiguates (use `clean_eval_data.py` to clean up the data upfront)."
      ],
      "id": "italic-parish"
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "intimate-astronomy"
      },
      "source": [
        "eval_text = \"\"\"PLAIN\\ton\\t<self>\n",
        "DATE\\t22 july 2012\\tthe twenty second of july twenty twelve\n",
        "PLAIN\\tthey\\t<self>\n",
        "PLAIN\\tworked\\t<self>\n",
        "PLAIN\\tuntil\\t<self>\n",
        "TIME\\t12:00\\ttwelve o'clock\n",
        "<eos>\\t<eos>\n",
        "\"\"\"\n",
        "\n",
        "INPUT_FILE_EVAL = f'{DATA_DIR}/test_eval.txt'\n",
        "\n",
        "with open(INPUT_FILE_EVAL, 'w') as f:\n",
        "    f.write(eval_text)\n",
        "! cat $INPUT_FILE_EVAL"
      ],
      "id": "intimate-astronomy",
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "corporate-contest"
      },
      "source": [
        "! python $NEMO_TOOLS_PATH/run_evaluate.py --input=$INPUT_FILE_EVAL"
      ],
      "id": "corporate-contest",
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "raised-exemption"
      },
      "source": [
        "`run_evaluate.py` call will output both **sentence level** and **token level** accuracies. \n",
        "For our example, the expected output is the following:\n",
        "\n",
        "```\n",
        "Loading training data: data_dir/test_eval.txt\n",
        "Sentence level evaluation...\n",
        "- Data: 1 sentences\n",
        "100% 1/1 [00:00<00:00, 58.42it/s]\n",
        "- Denormalized. Evaluating...\n",
        "- Accuracy: 1.0\n",
        "Token level evaluation...\n",
        "- Token type: PLAIN\n",
        "  - Data: 4 tokens\n",
        "100% 4/4 [00:00<00:00, 504.73it/s]\n",
        "  - Denormalized. Evaluating...\n",
        "  - Accuracy: 1.0\n",
        "- Token type: DATE\n",
        "  - Data: 1 tokens\n",
        "100% 1/1 [00:00<00:00, 118.95it/s]\n",
        "  - Denormalized. Evaluating...\n",
        "  - Accuracy: 1.0\n",
        "- Token type: TIME\n",
        "  - Data: 1 tokens\n",
        "100% 1/1 [00:00<00:00, 230.44it/s]\n",
        "  - Denormalized. Evaluating...\n",
        "  - Accuracy: 1.0\n",
        "- Accuracy: 1.0\n",
        " - Total: 6 \n",
        "\n",
        "Class      | Num Tokens | Denormalization\n",
        "sent level | 1          | 1.0  \n",
        "PLAIN      | 4          | 1.0  \n",
        "DATE       | 1          | 1.0  \n",
        "CARDINAL   | 0          | 0    \n",
        "LETTERS    | 0          | 0    \n",
        "VERBATIM   | 0          | 0    \n",
        "MEASURE    | 0          | 0    \n",
        "DECIMAL    | 0          | 0    \n",
        "ORDINAL    | 0          | 0    \n",
        "DIGIT      | 0          | 0    \n",
        "MONEY      | 0          | 0    \n",
        "TELEPHONE  | 0          | 0    \n",
        "ELECTRONIC | 0          | 0    \n",
        "FRACTION   | 0          | 0    \n",
        "TIME       | 1          | 1.0  \n",
        "ADDRESS    | 0          | 0    \n",
        "```"
      ],
      "id": "raised-exemption"
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "imported-literacy"
      },
      "source": [
        "# C++ deployment\n",
        "\n",
        "The instructions on how to export `Pynini` grammars and to run them with `Sparrowhawk`, could be found at [NeMo/tools/text_processing_deployment](https://github.com/NVIDIA/NeMo/tree/main/tools/text_processing_deployment)."
      ],
      "id": "imported-literacy"
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "bronze-nerve"
      },
      "source": [
        "# WFST and Common Pynini Operations\n",
        "\n",
        "Finite-state acceptor (or FSA) is a finite state automaton that has a finite number of states and no output. FSA either accepts (when the matching patter is found) or rejects a string (no match is found). "
      ],
      "id": "bronze-nerve"
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "heavy-distance"
      },
      "source": [
        "print([byte for byte in bytes('fst', 'utf-8')])\n",
        "\n",
        "# create an acceptor from a string\n",
        "pynini.accep('fst')"
      ],
      "id": "heavy-distance",
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "brave-avatar"
      },
      "source": [
        "Here `0` - is a start note, `1` and `2` are the accept nodes, while `3` is a finite state.\n",
        "By default (token_type=\"byte\", `Pynini` interprets the string as a sequence of bytes, assigning one byte per arc. \n",
        "\n",
        "A finite state transducer (FST) not only matches the pattern but also produces output according to the defined transitions."
      ],
      "id": "brave-avatar"
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "paperback-female"
      },
      "source": [
        "# create an FST\n",
        "pynini.cross('fst', 'FST')"
      ],
      "id": "paperback-female",
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "blond-hypothetical"
      },
      "source": [
        "Pynini supports the following operations:\n",
        "\n",
        "- `closure` - Computes concatenative closure.\n",
        "- `compose` - Constructively composes two FSTs.\n",
        "- `concat` - Computes the concatenation (product) of two FSTs.\n",
        "- `difference` - Constructively computes the difference of two FSTs.\n",
        "- `invert`  - Inverts the FST's transduction.\n",
        "- `optimize` - Performs a generic optimization of the FST.\n",
        "- `project` - Converts the FST to an acceptor using input or output labels.\n",
        "- `shortestpath` - Construct an FST containing the shortest path(s) in the input FST.\n",
        "- `union`- Computes the union (sum) of two or more FSTs.\n",
        "\n",
        "\n",
        "The list of most commonly used `Pynini` operations could be found [https://github.com/kylebgorman/pynini/blob/master/CHEATSHEET](https://github.com/kylebgorman/pynini/blob/master/CHEATSHEET). \n",
        "\n",
        "Pynini examples could be found at [https://github.com/kylebgorman/pynini/tree/master/pynini/examples](https://github.com/kylebgorman/pynini/tree/master/pynini/examples).\n",
        "Use `help()` to explore the functionality. For example:"
      ],
      "id": "blond-hypothetical"
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "arctic-firewall"
      },
      "source": [
        "help(pynini.union)"
      ],
      "id": "arctic-firewall",
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "continued-optimum"
      },
      "source": [
        "# NeMo ITN API"
      ],
      "id": "continued-optimum"
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "twenty-terrorist"
      },
      "source": [
        "NeMo ITN defines the following APIs that are called in sequence:\n",
        "\n",
        "- `find_tags() + select_tag()` - creates a linear automaton from the input string and composes it with the final classification WFST, which transduces numbers and inserts semantic tags.  \n",
        "- `parse()` - parses the tagged string into a list of key-value items representing the different semiotic tokens.\n",
        "- `generate_permutations()` - takes the parsed tokens and generates string serializations with different reorderings of the key-value items. This is important since WFSTs can only process input linearly, but the word order can change from spoken to written form (e.g., `three dollars -> $3`). \n",
        "- `find_verbalizer() + select_verbalizer` - takes the intermediate string representation and composes it with the final verbalization WFST, which removes the tags and returns the written form.  \n",
        "\n",
        "![alt text](images/pipeline.png \"Inverse Text Normalization Pipeline\")"
      ],
      "id": "twenty-terrorist"
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "twenty-charles"
      },
      "source": [
        "# References and Further Reading:\n",
        "\n",
        "\n",
        "- [Zhang, Yang, Bakhturina, Evelina, Gorman, Kyle and Ginsburg, Boris. \"NeMo Inverse Text Normalization: From Development To Production.\" (2021)](https://arxiv.org/abs/2104.05055)\n",
        "- [Ebden, Peter, and Richard Sproat. \"The Kestrel TTS text normalization system.\" Natural Language Engineering 21.3 (2015): 333.](https://www.cambridge.org/core/journals/natural-language-engineering/article/abs/kestrel-tts-text-normalization-system/F0C18A3F596B75D83B75C479E23795DA)\n",
        "- [Gorman, Kyle. \"Pynini: A Python library for weighted finite-state grammar compilation.\" Proceedings of the SIGFSM Workshop on Statistical NLP and Weighted Automata. 2016.](https://www.aclweb.org/anthology/W16-2409.pdf)\n",
        "- [Mohri, Mehryar, Fernando Pereira, and Michael Riley. \"Weighted finite-state transducers in speech recognition.\" Computer Speech & Language 16.1 (2002): 69-88.](https://cs.nyu.edu/~mohri/postscript/csl01.pdf)"
      ],
      "id": "twenty-charles"
    }
  ]
}